An efficient framework for obtaining the initial cluster centers

Clustering is an important tool for data mining since it can determine key patterns without any prior supervisory information. The initial selection of cluster centers plays a key role in the ultimate effect of clustering. More often researchers adopt the random approach for this purpose in an urge to get the centers in no time for speeding up their model. However, by doing this they sacrifice the true essence of subgroup formation and in numerous occasions ends up in achieving malicious clustering. Due to this reason we were inclined towards suggesting a qualitative approach for obtaining the initial cluster centers and also focused on attaining the well-separated clusters. Our initial contributions were an alteration to the classical K-Means algorithm in an attempt to obtain the near-optimal cluster centers. Few fresh approaches were earlier suggested by us namely, far efficient K-means (FEKM), modified center K-means (MCKM) and modified FEKM using Quickhull (MFQ) which resulted in producing the factual centers leading to excellent clusters formation. K-means, which randomly selects the centers, seem to meet its convergence slightly earlier than these methods, which is the latter’s only weakness. An incessant study was continued in this regard to minimize the computational efficiency of our methods and we came up with farthest leap center selection (FLCS). All these methods were thoroughly analyzed by considering the clustering effectiveness, correctness, homogeneity, completeness, complexity and their actual execution time of convergence. For this reason performance indices like Dunn’s Index, Davies–Bouldin’s Index, and silhouette coefficient were used, for correctness Rand measure was used, for homogeneity and completeness V-measure was used. Experimental results on versatile real world datasets, taken from UCI repository, suggested that both FEKM and FLCS obtain well-separated centers while the later converges earlier.

In every aspect of our day to day requirements it is often necessary to sensibly organize data into their relevant groups.This not only gives clarity about their whereabouts but also helps us to pick them from their respective assemblage much faster.So, the most important thing that needs to be considered is the correct grouping among them.Consequently, in order to expedite the retrieval of relevant information from a group or sub-group, numerous consistent practices have been developed, one of which is data clustering (Odell and Duran 1 ).The main goal is to create divisions for the whole data set into reasonably smaller homogenous subdivisions so that the objects present in a subgroup will be having similar characteristics with each other and reasonably differ from those present in other subgroups.
For any clustering technique to come to practice, the first and foremost step is the selection of initial cluster centers which defines the number of clusters to be created.There are various ways in which the initial cluster centers are initialized.This initialization plays a crucial role for the end result of clustering.Random selections of the initial centroids are preferred by many clustering algorithms.This adds to the simplicity of the approach and also to the computation time of the algorithm.When the initial centroids are chosen randomly, roughly no time is spent on their selection step, which lessens the overall execution time as well as the time complexity of the algorithm.However, with random initialization of centroids, different runs will produce different clustering results.Sometimes, the result will show excellent subgroups formation while in most cases the resulting clusters www.nature.com/scientificreports/Yao et al. 14 grouped the non-numeric attributes present in the dataset according to their properties and discovered the analogous similarity metrics in that order.They proposed a method for finding the initial centroids based on the dissimilarities and compactness of data.Once the centers were obtained, clustering was performed on modified inter-cluster entropy for miscellaneous data.Results concluded that since the initial centers determined were optimal so it resulted in good clustering accuracy rate (CAR).Ren et al. 15 suggested a two-step structure for scalable clustering in which the initial step determines the frame structure of data and the final step does the actual clustering.Data objects are initially placed across a 2-D grid and are clustered using different algorithms, each giving a set of partial core points.These points correspond to the dense parts of data, which form centers for center-based, modes for density-based or means for probability-based types of clustering.This method can speed-up the computation and produces robust clustering.Results have shown the usefulness of the method.
Franti et al. 16 suggested a better initialization technique that improves the clustering efficiency of K-means.When there are overlapping clusters, using farthest point heuristic, malicious clusters may be reduced from 15 to 6% and when the method was repeated for 100 times, a further reduction to 1% was noted.However, they remarked that dataset with well separated clusters depends mostly on proper initialization of centers.
Mehta, et al. 17 discussed and analyzed several proximity measures and suggested a way for choosing a proximity measure that can be used in hierarchical and partitioned clustering.They concluded that the average performance of clustering changes when diverse proximity measures were implemented.Mehta et al. 18 further researched in document clustering in text mining and proposed a method by hybridizing the statistical and semantic features.The technique uses a fewer number of features but this hybridization improves the textual clustering and provides better precision within acceptable time limit.
Shuai 19 proposed an improved feature selection and clustering framework.This model initially does the data processing and then uses the feature selection to obtain important features from the dataset.This was followed by hybridizing K-means and SOM neural network to perform the actual clustering.Finally, collaborative filtering was used to cluster datasets which constituted missing data to make sure that all samples can acquire results.Results obtained showed high accuracy in clustering and interpretability.Nie et al. 20 proposed a clustering technique where there is no need to calculate the cluster centers in each iteration.In addition, the proposed technique provides an efficient iterative re-weighted approach to solve the optimization problem and shows a faster convergence rate.
Ikotun et al. 21broadly presented a summary and taxonomy of the widely used K-means clustering algorithm and its different variants created by various researchers.The record of K-means, recent developments, different issues and challenges, and suggested potential research viewpoint are discussed.They have found most of the research work has been carried out on solving the initialization issues of K-means however, very little focus has been given on addressing the problem of mixed data type.This survey will help practitioners to work on this aspect.

Methods
One of the major drawbacks of traditional K-Means approach is the initial center selection which is done randomly.Random selection of centers may perhaps result in incorrect creation of clusters.Due to this issue, there are few suggestions mentioned in this work with an aim to minimise this limitation.In addition to this, there are suggestions for reducing the time complexity, actual computation time and the convergence criteria of the projected methods.Subsequently, all the methods are evaluated to verify how good each of them creates the subgroups.The methods for choosing the initial cluster centers are discussed below:

Method-1: K-means
This is an unsupervised learning algorithm (Mac Queen 22 ) to find K sub-groups from a given set of data, where the value of K is defined by the user.Initially, random K data points are selected as cluster centers.Each data point is assigned to its nearest cluster center to form disjoint sub-groups.Then the cluster centers are overwritten with the values obtained by taking mean of the data points present inside it.This process of reassigning data points to the nearest center and updating the cluster centers is repeated until there are no changes to the centers.The steps followed in K-means are: • Take random K data points as initial centers • Assign each data to nearest cluster center • Update cluster centers by taking mean • Repeat steps 2-3 until convergence K-Means is the simplest way for getting different sub-groups of a given input.However, taking K random centers in step-1 makes it unpredictable.Every time this algorithm is executed it produces different results.So there is no precision in the clustering outcome.Even there is a possibility that at times it may result in creation of empty clusters.
It has an indefinite time complexity, since it is computationally a NP-hard problem.
Step-2 of the procedure may repeat for an uncertain amount of time.However, if we constraint the clustering loop to a fixed value i, such that i < n, then the complexity of the algorithm will become O (i * K * n * d), where K and n are the number of clusters and data points respectively and d is the dimension of each data.Practically, the values of K and dare fixed and significantly smaller than n, hence the complexity is O (n).
To overcome the limitations of selecting the initial cluster centers randomly, Mishra et al. 23 suggested an innovative approach.The central idea of their approach is to obtain distant points as initial center which will result in disjoint, tight and precise cluster formation.The pseudo code representation of algorithm is as follows: 1. // Farthest pair is determined for di in dataset: for each ci in range (0, i): for each dt in dataset: The algorithm works as follows.Distance of each data point to all other points present in the data set are computed and the data pair which lies farthest from each other are considered as first two initial centers in step 1 and step 2. Step 4 assigns data points to their nearest center until a pre-defined threshold value is reached.In step 5 and 6 centers are updated by taking mean of the partial clusters formed in previous step.The remaining K−2 centers are computed in step 8.The process of obtaining the centers is illustrated in Fig. 1.
Due to brute force comparisons between all the data points to find the farthest pair in Step 1, the loop has to run n 2 times making its complexity Ɵ (n 2 ).Hence, the overall complexity of FEKM is Ɵ (n 2 ).The major drawback of FEKM is its worst case running time complexity which is Ɵ (n 2 ).Methods like FEKM with quadratic time complexity are feasible for small data sets.

Method-3: modified center K-means (MCKM)
Considering the quadratic time complexity of FEKM, (Mishra et al.IJISA 24 ) suggested MCKM for selecting the initial cluster centers with complexity less than Ɵ (n 2 ).The idea is to sort the data points with respect to a fixed point of reference, which is the last data as present in the data set matrix.Then divide the data set into K equal subgroups and compute the mean of each group to find initial centers.The procedure is presented as a pseudo code which is as follows: Step 2 sets the very last element present in the data set matrix as the first center which is the point of reference.In step-4, distance of each element is determined from the point of which makes its complexity O (n).
Step 5 sorts all data using a sorting technique of complexity O (n* log (n)).In step 6, the sorted list is splitted into K equal subgroups which will cause the loop to run n times making its complexity as Ɵ (n).The centers are updated and mean of each group gives the rest k-1 center in step 7 whose complexity is Ɵ (n).Thus, overall complexity of MCKM is O (n* log (n)).This method is able to provide a systematic and efficient procedure to obtain initial centers.Unlike FEKM it has a better running time complexity.However, FEKM has an upper hand over MCKM in obtaining distant initial centers which results in better cluster formation.

Hastening FEKM by constructing convex hull
FEKM uses brute force technique to find the farthest data pair and considers them as the first two initial centers.This results in quadratic time complexity of the algorithm.So, in order to reduce the complexity of FEKM, the concept of convex hull (Cormen 25 ) is used.Convex hull is the smallest convex polygon that encloses all the data points of a given data set.These points are selected in such a way that, there exist no other points which remains outside the hull.By computing the convex hull, it is only required to compare the data points which form the vertices of the hull instead of considering every data points to obtain the farthest pair.There are two approaches considered in this research to achieve the farthest centers using convex hull.They are as follows: www.nature.com/scientificreports/Farthest centers using graham scan (FCGS) Graham scan approach (Graham 26 ) maintains a stack which contains only the vertices of the hull.The procedure pushes all the data points into the stack and ultimately pops the point that is not a vertex.The distance between each vertex is determined to obtain the farthest pair to be considered as first two centers.Hence, it is not required to compare the distance between all other points.This reduces the complexity of the algorithm.The algorithm for FCGS is as follows:  Step 1 of the algorithm chooses a point p 0 with smallest y-coordinate value or the leftmost x-coordinate value if two or more points have equal y-coordinates.For this step 1 need to traverse to reach every data points which make its complexity Ɵ (n).In step 2 data points excluding p 0 are sorted as per the polar angle around p 0 in anti-clockwise order using a sorting algorithm of complexity O (n * log (n)).First three points of the sorted data set i.e. p 0 , p 1 , p 2 are pushed into a stack in step 4. In each iteration of step 5, a data point is pushed into the stack and the orientation (Cormen 25 ) formed by top three elements of the stack is checked.If the orientation is clock wise or non-left then pop( ) operation is performed.As it traverses n−3 points, step 5 has a complexity of O (n).After obtaining the stack of hull vertices, each vertex is distance-wise compared to find the farthest pair in step 7. Assuming number of elements in stack is m, the complexity of step 7 is Ɵ (m 2 ).The remaining k−2 centers are computed in step 9 following FEKM.As m will be always less than n, overall complexity of FCGS is O (n * log(n)).Therefore, this method is able to obtain initial centers which will boost clustering performance with a reduced complexity of O (n * log(n)).
However, Graham scan is designed only for data sets with two attributes.If the dataset consists of more than two attributes then the algorithm fails because there are multiple values of polar angles.Dimensionality reduction is a way but may increase the running time of the algorithm drastically.Now, one of the solutions of FCGS is by using a method which can construct the convex hull in multiple dimensions.Modified FEKM using Quick hull is a solution to this aspect.

Modified FEKM using Quickhull (MFQ)
Quickhull proposed by (Bradford Barber 27 ) may be used to construct convex hull for n-dimensional data.It computes the convex hull in a divide and conquer approach recursively.The pseudo code of algorithm is as follows:  First seven steps of the algorithm are used to construct a convex hull.In this method, a list named hull is declared which store the vertices of the convex hull.The points with maximum and minimum x coordinate values are assigned to variables max_x and min_x respectively.A line L is constructed by joining min_x and max_x.This line will divide the data sets into two parts.A point p is found which is farthest from L and is added to hull.A triangle is formed by joining p and two end points of L. The points lying inside the triangle are not considered for further construction of the hull.For the remaining points present outside the triangle, each side of the triangle is assigned as L and steps5& 6of the algorithm are repeated recursively until there are no points left outside the last triangle formed.After obtaining the convex hull, the vertices lying on it are compared with each other to find the farthest pair, which is considered as first two initial cluster centers.Finally, step8 of FEKM is repeated to obtain K−2 remaining centers.The above procedure is illustrated in Fig. 2.
Step 2 of the method costs O (n) for traversing the data set to find minimum and maximum x-coordinate values.
Step 5 computes a point farthest from line L making its complexity O (n).Step 5 and 6 is repeated recursively until no points are left outside the new triangle formed.The worst case complexity of Quickhull is O (n * log (n)) if dimension of the data is less than or equal to three.When the number of attributes is more than three then the complexity of this method increases.Next step8 of FEKM is called to determine the remaining K-2 initial centers which make its complexity O(n).So, unlike FCGS this method is able effectively operate on datasets with more number of attributes with a worst case of O (n * log (n)).However, the worst case complexity of this algorithm may exceed O (n 2 ) when we have large number of attributes in data sets.In these cases, this method will prove to be inefficient than FEKM.For this reason, we suggest a method called Farthest Leap Center Selection (FLCS) to compute the farthest data pair in less than quadratic time complexity.Due to the limitations of Graham Scan and Quick hull mentioned above, we came up with a new approach called FLCS, to solve the farthest pair problem.FLCS takes a greedy approach to solve this problem.Instead of traversing every data point, it only leaps to the next point which is at a maximum distance from the current point.The method stops leaping when the next farthest point is the same previous point.The pseudo code of algorithm is as follows: 1. set mn = mean of all data points  Initially, the mean of all data points is calculated and is assigned as data 1 and prev.The farthest point from data 1 is computed by comparing it with other data points and assigned to data 2 .Then, data 2 is assigned to data 1 and data 1 is assigned to prev.Again the farthest point from data 1 is determined and assigned to data 2 .These steps are repeated until data 2 and prev refers to the same data point.These steps are illustrated in Fig. 3.The farthest pairs data 1 and data 2 obtained are assigned as first two initial centers.To obtain remaining K−2 centers step-8 of FEKM is repeated.
Step 1 of the algorithm traverses all data points once to find their mean making its complexity Ɵ (n).
Step 5 is used to leap from the current data point to the farthest point in order to skip redundant comparisons needed to compute the farthest pair.If we assumes number of leaps is i, then step 5 iterates i * n times.However, i is much smaller than n.So, the complexity of step5 is O(n).Then step 8 of the FEKM is used to compute remaining K−2 centers making complexity of step7 of FLCS O (n).Therefore, overall complexity of FLCS is O (n).

Lemma:
The number of leaps in FLCS will always be less than the total number of data points.
Given: A set of data points S, and number of leaps I. To prove: I <|S| Proof: Let us construct a convex hull on data set S with a set of vertices V. So, there exist no data points which will be present outside the hull (Cormen 25 ).Now, there are exactly two spaces from which the method can leap.Case I: Leaping from within the convex hull: The farthest point from any point inside the convex hull is a vertex of the hull, as there exist no other points beyond the convex hull.
Case II: Leaping from any vertex of the convex hull: The farthest point from a vertex of the hull will be another vertex.
By considering the above two cases it can be concluded that, once the procedure starts leaping from the first point present inside the convex hull (the first point is obtained by taking the mean of all data points), then it

Parameters for evaluation
The quality of clusters formed after the clustering process is evaluated by using few validity indices such as DI, DBI and SC.These indices measure the goodness of clusters on basis of their inter-cluster, intra-cluster distance and similarity of instances present in their created clusters.The data sets contain the class label which is used as ground truth label to compute the clustering accuracy through Rand Index.The clustering outcomes are assessed to verify if they generate perfectly homogenous and complete subgroups.For this V measure was chosen as a parameter for evaluation.The execution time of all considered methods is recorded for each of the input data sets to practically verify their efficiency.

Validity indices
Cluster validation techniques are used to measure the quality of a cluster appropriately.Few validity indices used for this purpose are given as follows: Dunn's index (DI) DI measure (Dunn 29 ) is used to minimize the intra-cluster and maximize the inter-cluster distances.It can be defined as follows- (2) www.nature.com/scientificreports/d is any given distance function used for this purpose, and A j is a set consisting of the data points which are assigned to the ith cluster.Usually, any method which produces a larger value of DI determines that the cluster formed is compact and well-separated from other clusters.
Davies-Bouldin's index (DBI) DBI (Davies et al. 30 ) is the ratio of the sum of data currently present within a cluster to those data remaining outside it.The data present within ith cluster distribution is given by: The data which are between ith and jth partition is given by: where v i is denoted as the ith cluster center, and both q & t are numerical values that can be chosen independently of each other and (q, t) ≥ 1. |A i | is the number of elements that are present in A i. Subsequently, R i,qt is determined which is given by the equation: Finally, Davies-Bouldin's index is obtained which is specified as follows: The objective of the clustering methods should focus on obtaining a minimum value of DBI for achieving proper clustering.

Silhouette coefficient (SC)
In SC (Rousseeuw and Silhouettes 31 ), for any data point d i , initially the average distance from d i to all other data points belonging to its own cluster is determined, which is denoted as a.Then, the minimum average distance from d i to all other data points present in other clusters are determined, which is b.Then the silhouette coefficient is calculated as follows: The silhouette value s ranges between 0 and 1.If s is close to 1, it indicates that the sample is well-clustered, if silhouette value is almost equal to zero, it indicates that the sample lies equally far away from both the clusters, and if silhouette value is equal to -1, it indicates that the sample is somewhere in between the clusters.The number of cluster with maximum average silhouette width is taken as the optimal number of the clusters.

Clustering accuracy
The accuracy of a cluster can be measured by using Rand Index (Rand 32 ).Given N data points in a set D = {D 1 , D 2 , …, D N } and two clustering of them to compare C = {C 1 , C 2 , …, C K1 } and C′ = {C′ 1 , C′ 2 , …, C′ K1 }, we define Rand index as: where a is number of data pairs in D that are in the same subsets of both C and C′, b is number of data pairs in D that are in different subsets of both C and C′, c is number of data pairs in D that are in same subset of C and different subsets in C′, d is number of data pairs in D that are in different subset of C and same subsets in C′.The value for R is between 0 and 1, with 0 representing two clustering do not agree on any pair of data points and 1 representing the clustering are precisely identical.Liu 33 , Guo et al. 34 , and Zou et al. 35 investigated data collection in wireless powered underground sensor networks assisted by machine intelligence, matrix algebra in directed networks, and limited sensing and deep data mining, respectively.Shen et al. 36 , Cao et al. 37 , and Sheng et al. 38 (3) where, www.nature.com/scientificreports/respectively examined the modeling of relation paths for knowledge graph completion, optimization based on mobile data, and dataset for semantic segmentation of urban scenes.Lu et al. 39 , Li et al. 40 , and Xie et al. 41 considered multiscale feature extraction and fusion of images, patterns across mobile app usage, and a simple Monte Carlo method for estimating the chance of a cyclone impact.Recently, Liu et al. 42 , Li et al. 43 , and Fan et al. 44 developed a multi-labeled corpus of Twitter short texts, studied the long-term evolution of mobile app usage, and proposed axial data modeling via hierarchical Bayesian nonparametric models.

Homogeneity, completeness and V-measure
A cluster is said to be perfectly homogenous when it contains data points belonging to a single class label.Homogeneity reduces as data points belonging to different class labels are present in the same cluster.Similarly, a cluster is said to be perfectly complete if all the data points belonging to a class label are present in the same cluster.When the number of data points of a particular class label are distributed in different clusters, the completeness of the cluster decreases.where β can be set to favour homogeneity or completeness.If β is 1, then it favours both homogeneity and completeness equally.If β is greater than 1 then completeness is favoured and β less than 1 favours homogeneity.

Results and discussion
In order to analyze the practical performance of K-Means, FEKM, MCKM, MFQ and FLCS methods, we evaluated their results on various real world data sets [UCI Repository].The characteristics of the data set are given as follows: A diverse range of datasets are considered to evaluate the performance of the discussed methods.These data sets in Table 1 vary in dimensions, properties and classifications.Iris data set consists of three flower classes-setosa, virginica and versicolor.It has fifty instances in each class and a total of a hundred fifty instances.Each instance has four attributes-petal length & width and sepal length & width.Similarly, seed data set consists of two hundred ten instances, seven attributes and it contains wheat kernel of three classes-Kama, Rosa and Canadian.Adult data set consists of fourteen attributes-age, work class, final weight, education, marital-status, occupation, relationship, race, sex, capital gain, capital loss, hours per week and native-country.The data set is classified into two classes-person earning more than 50 k and those with less than or equal to 50 k per year.Balance data set contains six hundred twenty-five instances of a psychological experiment.Each instance has four features which include left weight, left distance, right weight and right distance.All instances divided into three classes-scale tip to right, tip to left and balanced.Haberman is the record from study of patients who survived from surgery of breast cancer which contains a record of three hundred six patients.Patients are classified into two categories-who died within five years and who survived five years or longer.It has three attributes-age of the patient at time of operation, year of operation and number of positive auxiliary nodes detected.TAE contains (11)  Homogeneity is given by h www.nature.com/scientificreports/teaching performance of one hundred fifty-one teaching assistances categorized into-low, medium and high as per their performances.Wine data set is the result of chemical analysis of wine growth in Italy from three different cultivators with one hundred and seventy eight samples.Each instance has thirteen features-alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavonoids, no flavonoid phenols, proanthocyanins color intensity, hue, OD315 of dilute wines and proline.Mushroom data set is record of eight thousand one hundred and twenty four mushroom samples with their twenty two characteristics.The data set is classified into two categories of mushroom-edible and poisonous.It was discussed earlier in Section "Parameters for evaluation"(A) that, a greater value of DI or SC and a smaller value of DBI suggests better quality of cluster formation.Tables 2, 3 and 4 contains the DI, DBI and SC scores respectively where clustering loop is restricted to twenty iterations and K = 3.From these tables it can be observed that methods FEKM, MCKM, MCQ and FLCS perform better and form quality clusters than K-Means.MCQ and FLCS shows promising performance, which can be seen from the tables.In each table the www.nature.com/scientificreports/best performing method is highlighted.Overall, it can be observed that for majority of the data sets for MCQ and FLCS performed better than other methods.The accuracy of cluster formed is the next parameter for experimental evaluation.Rand index score which was discussed in Section "Parameters for evaluation"(b), was considered for this aspect.Normally, the value of Rand Index lies between 0 and 1 and value nearer to 1 signifies better accuracy.From Table Table   16540.19350.1935Adult0.00020.000310.000280.000310.00031Haberman0.00060.00080.00080.00120.00125, the accuracy of clustering of each method can be observed on the referred data sets.MFQ and FLCS have scored better than other methods in majority of the data sets as their Rand index values are closer to 1.Only in few cases like seed, wine, glass and mushroom, FEKM and MCKM have a better Rand index value than FLCS and MCQ. Figure 4 shows the accuracy of different clustering methods performed on different datasets.
Next, an analysis was made to verify whether the random selection of centers as performed by K-Means will generate stable and accurate clusters or those using an approach where the centers are initially chosen by any innovative way.For better visualization, four datasets were preferred for this purpose viz, iris, seed, TAE and Haberman whose accuracy scores varies significantly.Each method including K-Means, FEKM, MCKM, MFQ and FLCS were executed seven times taking these four datasets as inputs.From Fig. 5a the unpredictability of K-Means can be clearly observed.For every execution there is a variable accuracy score Fig. 5b.Due to random initial centers, there is unpredictability in the formation of clusters due to which varied accuracy scores are generated Fig. 5c, d.This confirms that, in some cases when the random centers are chosen accurately, they produce well-organized clusters whereas in other cases, when the random centers are not precise they create malicious subgroups.Conversely, methods like FEKM, MCKM, MFQ and FLCS gives a stable and predictable output on each execution.
The following parameter for evaluation was based on determining the homogeneity and completeness of clusters using V-measure.Tables 6, 7 and 8 below illustrates these facts.For most of the data sets V-measure is closer to 1 or is on higher side for MFQ and FLCS.This implies that the clusters obtained for different data sets are precise.However, those using K-Means are not effective since the centers randomly chosen may not be the near optimal ones.Figure 6 presents an analysis of FLCS vs. K-Means, FEKM and MCKM respectively based on  V-measure score for different datasets.The graph indicates V-measure score of FLCS is comparatively higher than K-Means, FEKM and MCKM which suggests that the clusters obtained with FLCS are homogenous and complete.Further experiments were conducted for calculating the time taken by all the methods in determining the initial cluster centers and their convergence of clustering loop.All algorithms were executed on a machine with 5th Gen Intel ® i3 processor, 1.9 Ghz.clock speed and 4 GB RAM.Table 9 represents the actual execution time of all the algorithms for determining the initial centroids.From this table it can be seen that K-Means obtain the initial centers earlier than the others as this selection is done randomly.The average time taken for selection of cluster centers of each method is plotted in Fig. 7.The horizontal line in the graph represents the average time taken to select the initial centers of all the methods.From the graph it can be observed that FEKM and MFQ take slightly more time for finding the centers than the rest.MFQ by far takes the maximum out of them as its complexity increases with the increase in dimension of the data.MCKM and FLCS show similar performance, yet FLCS takes less time to compute the initial centers than MCKM due to its linear time complexity.The clustering loop convergence time i.e. time taken for the cluster formation by each method on the given datasets is noted in Table 10.The best performing method on particular data set is highlighted.It can be observed from the results that, FLCS has least convergence time in five out of ten data sets.MFQ performs similarly because their initial centers are same.FEKM also have smaller convergence time in few datasets.The average clustering loop convergence time of each method is plotted in Fig. 8.The horizontal line indicates the average clustering loop convergence time of all methods.The plot shows that K-Means takes more time for formation of clusters due to bad random initial centers.K-Means is the only method above the average line.Other methods decide the subgroup formation below the average line.FLCS perform faster than other methods due to its significantly distinct initial clusters which are obtained in less time.
Finally, overall execution time of each method employed on the given datasets is recorded.K-Means is the fastest in center selection but its random initial centers results in large convergence time.From Table 11 and Fig. 9 it can be seen that, FEKM and MFQ have overall larger execution time than others since much of their computation is spent on selecting the near-optimal centers.On the other hand, MCKM and FLCS show promising results.www.nature.com/scientificreports/ The performance and accuracy scores of MFQ and FLCS are more or less equal because the outcomes of the methods are always same.This is due to the fact that, both methods obtain exactly same K initial centers.However, the process of computing initial centers is totally different.From Fig. 9 difference can be clearly observed, FLCS is more efficient than MFQ in execution time.

Conclusion
The initial centre of a cluster is a decisive factor in its final formation as erroneous centroids may results in malevolent clustering.In this research, few approaches were suggested to decide the near optimal cluster centers.FEKM was proposed with an idea to obtain well separated clusters.It was quite effective with most of the datasets, only with a complexity issue which is in the higher side.In order to get a solution to this, MCKM was suggested.It improved the computational time to some extend but still was higher than K-Means which randomly selects its centers.For these reason FCGS and MFQ were used which selects the centers present only on the convex hull   www.nature.com/scientificreports/thereby, trims down the chance of considering all data points to be a candidate to form the centers.However, it was found that the clustering process was effective for datasets with two or three attributes only.Due to this reason, FLCS was suggested which is simple and effective in deciding the centers with lesser complexity.All these methods were thoroughly analyzed by considering the clustering effectiveness, correctness, homogeneity, completeness, complexity and their actual execution time of convergence.For this reason performance indices like DI, DBI, and SC index were used, for correctness Rand measure was used, for homogeneity and completeness V-measure was used.All these factors considered for testing the quality of cluster formation showed excellent results for all proposed algorithms as compared to K-Means clustering.This signifies the emergence of individual groups in which data present within the group remain at a closer proximity and the separation of data of one group to that of another is very far.Zhou et al. 45 , Cheng et al. 46 and Lu et al. 47 investigated water depth bias correction of bathymetric LiDAR point cloud data, situation-aware IoT service coordination and IoT service coordination using the event-driven SOA paradigm respectively.9][50] .Peng et al. 51 , Bao et al. 52 , Liu et al. 53 recently examined the community structure in evolution of opinion formation, limited realworld training data and robust online tensor completion for IoT streaming data recovery respectively.Liu et al. 54 worked on federated neural architecture search for medical sciences.6][57] exhibits the recent and newly development in the IoT field like next-generation wireless data center network, fusion network for transportation detection and wireless sensor networks with energy harvesting relay.This work guided us to quite a few exciting and innovative research opportunities which can be further explored viz, both the time complexity and computation time of the suggested methods can be further reduced, the concept of convex hull can still be properly used to obtain better cluster centers on versatile data sets.

Figure 1 .
Figure 1.Computation of centers (a) farthest pair p 1 -p 2 are two initial centers (b) distance of each point to its nearest center calculated (c) p 3 is selected as third center since it is at maximum distance from its nearest center.

Figure 2 .
Figure 2. Illustration of MFQ (a) Step 1-7 of the algorithm is used to construct convex hull (b) Step 8 of algorithm compares all the vertices of the hull to find farthest pair as first two initial centers (c) Step 8 of FEKM is used to find remaining K−2 centers.

Figure 3 .
Figure 3. Selection of first two initial centers by discovering the farthest pair (a) Mean of data set is assigned as data 1 and prev (b) Farthest point from data 1 is data 2 (c) Farthest point from data 2 found and new farthest point is labeled as data 2 and original data 2 is labeled as data 1 (d) data 1 is labeled as prev(e) Step (c) and (d) are repeated until prev and data 2 refers to the same data point (f) data 1 and data 2 are the farthest pair and two initial centers. https://doi.org/10.1038/s41598-023-48220-3 If a data set contains N number of data, cl different class labels, separated into K clusters and d data points belonging class c and cluster i then, where F(cl, K)=−

Figure 4 .
Figure 4. Clustering Accuracy performed on different Datasets.
Cluster data to its nearest center till threshold is reached

)
Thus, V − measure A. Rosenberg and J. Hirschberg, 2007 is given by, V =

Table 1 .
Characteristics of data sets.